Computational and Structural Biotechnology Journal — Latest Matching Preprints

1

Smart AI-Powered Machine Learning Risk Assessment for Early Osteoporosis Detection for Women Bone Health

Monfared, V.

2026-06-02 orthopedics 10.64898/2026.05.31.26354550 medRxiv

Top 0.1%

19.6%

Show abstract

Osteoporosis is often called a silent disease because it progresses without symptoms until a fracture occurs, posing a serious, yet frequently overlooked, threat to women health. In response to the pressing need for early detection, we introduce OsteoInsight, an intelligent, AI-powered web application designed to assess osteoporosis risk with both clinical accuracy and interpretability. Built on a Random Forest classifier trained on over 2000 women health records, our model incorporates a wide range of domain-informed features, including hormonal history, lifestyle factors, reproductive health, and conditions affecting bone health. Despite an imbalanced dataset, with around 75% of cases being osteoporosis-positive, the model achieved encouraging results: 71.81% accuracy, an F1-score of 0.79, and an AUC-ROC of 0.78. SHAP analysis highlighted age, BMI, and menstrual history as key predictors, offering transparent insights into the model reasoning. Additional contributors like fracture history, signs of low estrogen, and lactation duration were also found to be significant, enriching the interpretability of predictions. These insights are seamlessly integrated into OsteoInsight user interface, making risk assessments not only accessible but also understandable for both clinicians and users. Our findings underscore the potential of AI-driven tools to enhance early screening and enable personalized risk profiling, empowering women and healthcare providers to take proactive steps in osteoporosis prevention.

2

A novel SXXLF motif in the FXR N-terminal domain mediates coregulator and interdomain interactions

Villalona, P.; Pulahinge, T.; Yu, T.; Wenning, J.; Frisbie, C. J.; Magafas, J.; Okafor, C. D.

2026-05-20 biochemistry 10.64898/2026.05.18.724725 medRxiv

Top 0.1%

14.2%

Show abstract

The nuclear receptor superfamily is comprised of ligand-regulated transcription factors that contain an intrinsically disordered domain at the amino-terminal end, known as the N-terminal domain (NTD). While this poorly conserved domain is known to possess ligand-independent activation function (AF-1), few NTD functions are conserved between nuclear receptors (NRs). Identified roles in other receptors include androgen receptor (AR), estrogen receptor (ER) and mineralocorticoid receptor (MR). Here, we aim to define the function of the NTD of the farnesoid X receptor (FXR), a crucial regulator of lipid and bile acid metabolism. We show that the NTD engages in interdomain contact with other FXR domains. We also observe that the NTD interacts directly with coregulator proteins. Using mutagenesis, mammalian two-hybrid assays and molecular dynamics simulations, we identify and validate a novel SXXLF motif in the NTD which mediates interactions with both coregulators and the ligand binding domain. Mutation of the motif induces large changes in conformational and allosteric coupling in FXR. Our study identifies a new nuclear receptor-interacting motif that modulates the transcriptional activity of FXR. Graphical AbstractFXR-NTD regulates transcriptional activity through interdomain communication with the LBD and is also involved in co-activator recruitment. The SENLF motif is the first defined functional element within the FXR-NTD and mediates both NTD-LBD interaction and selective co-activator engagements to drive NTD-mediated transcriptional activity. O_FIG O_LINKSMALLFIG WIDTH=135 HEIGHT=200 SRC="FIGDIR/small/724725v1_ufig1.gif" ALT="Figure 1"> View larger version (25K): org.highwire.dtl.DTLVardef@5a37aorg.highwire.dtl.DTLVardef@2fa9e1org.highwire.dtl.DTLVardef@13a19daorg.highwire.dtl.DTLVardef@1775ed2_HPS_FORMAT_FIGEXP M_FIG C_FIG

3

Learning from Drops: AI-Guided Integration of Liquid Biopsy Features in Cancer Studies

Andueza, M.; Villoslada-Blanco, P.; De Dreuille, B.; Alonso, L.; Sabroso-Lasa, S.; Pantel, K.; Alix-Panabieres, C.; Lopez de Maturana, E.; Malats, N.

2026-05-17 bioinformatics 10.64898/2026.05.12.724535 medRxiv

Top 0.1%

14.2%

Show abstract

Cancer is a major global health issue with rising incidence and mortality. Early detection, tumor characterization, and disease surveillance are crucial for timely and effective treatment, ultimately reducing mortality rates. Liquid biopsy (LB) has emerged as a valuable detection tool offering a non-invasive method to determine tumor-derived biomarkers in body fluids with demonstrated translational potential. To increase biomarker sensitivity, high-throughput sequencing platforms deliver massive volumes of data. Artificial Intelligence (AI) is pivotal in enabling huge and complex data integration. This contribution aims to assess the current state of integrative AI-based research in the LB field and provide methodological guidance. First, we conducted a PubMed search and found that the literature is sparse in studies integrating LB features, particularly by applying AI. When adopting the latter approach, defining the study objectives is crucial to guide the subsequent methodological aspects, including study design, patient selection criteria, sample size, nature of the LB features, and metadata to collect. Specifically, we propose strategies and tools for data preprocessing, including normalization and batch correction, as well as handling outliers and missing data. Furthermore, we recommend various Machine/Deep Learning approaches for feature selection techniques to ensure model robustness, and we highlight the importance of undergoing rigorous internal and external validations of the selected models. Assessing clinical utility and interpretability is often overlooked but fundamental for real-world implementation. In conclusion, we provide the LB scientific community with an AI-based methodological guidance to bridge the two fields and enhance the integrative analysis of LB features. Graphical abstractWorkchart for multiomics integrative studies in the liquid biopsy field. Note: CTCs, circulating tumor cells; ctDNA, circulating tumor-DNA; TEPs, tumor-educated platelets; miRNA, microRNA; cfRNAs, cell-free RNAs. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=159 SRC="FIGDIR/small/724535v1_ufig1.gif" ALT="Figure 1"> View larger version (45K): org.highwire.dtl.DTLVardef@1f250b2org.highwire.dtl.DTLVardef@18fe36corg.highwire.dtl.DTLVardef@19c02b9org.highwire.dtl.DTLVardef@176f6e0_HPS_FORMAT_FIGEXP M_FIG C_FIG

4

RT-nested and interfering-Primer PCR reveal prevalent isoform-specific A-to-I RNA editing in neuronal genes

Wang, Z.; Ni, Y.; Cai, W.; Li, H.; Duan, Y.

2026-05-17 molecular biology 10.64898/2026.05.15.725286 medRxiv

Top 0.1%

12.9%

Show abstract

BackgroundMetazoan adenosine-to-inosine (A-to-I) mRNA editing temporospatially diversifies the neuronal transcriptome and proteome. The limited read length from next-generation sequencing (NGS) constrains the quantification of the potentially differential editing levels across different splicing isoforms, restricting our understanding of the extent to which RNA editing contributes to molecular diversity and its interplay with splicing. MethodsWe employed reverse transcription nested PCR (RT-nPCR) and developed a novel interfering-Primer PCR (iPrimer PCR) technique to distinguish different transcripts of any gene. We selected multiple essential genes exhibiting RNA editing in coding sequences (CDSs) or untranslated regions (UTRs) for isoform-specific amplification and Sanger sequencing. ResultsNine different Adar isoforms together with pre-mRNA had distinct editing levels at the S>G auto-recoding site, which was predicted to have isoform-specific effects on catalytic activities. Although pre-mRNA editing might exert isoform-dependent promotion/suppression of splicing, closely located editing sites, such as those in neuronal genes qvr and stj, still exhibited high correlation in editing levels due to co-editing. iPrimer strategy further discovered differential recoding levels between the long/short 3UTR isoforms of gene jef. ConclusionsWe provide the first comprehensive solution for isoform-specific PCR amplification of any gene, enabling quantification of RNA editing level of different isoforms. Our results offer insights into how RNA editing interplays with splicing, and highlight its complicated role in expanding molecular diversity. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=79 SRC="FIGDIR/small/725286v1_ufig1.gif" ALT="Figure 1"> View larger version (17K): org.highwire.dtl.DTLVardef@1ebc82org.highwire.dtl.DTLVardef@1ea365dorg.highwire.dtl.DTLVardef@1971aceorg.highwire.dtl.DTLVardef@160d053_HPS_FORMAT_FIGEXP M_FIG C_FIG We developed isoform-specific PCR followed by Sanger sequencing, and achieved the quantification of differential RNA editing levels in different transcripts of a gene.

5

Non-invasive Transcriptomic Cell Profiling of the Human Endometrium with Generative Deep Learning

Meltsov, A.; Falcon-Perez, J. M.; Matorras, R.; Apostolov, A.; Sola-Leyva, A.; Esteki, M. Z.; Salumets, A.; Aleksejeva-Zagura, E.

2026-05-20 obstetrics and gynecology 10.64898/2026.05.18.26352867 medRxiv

Top 0.1%

12.2%

Show abstract

Background Delineating the cellular origins of extracellular vesicles (EVs) enables the detection of clinically relevant changes in dynamic and complex tissues, such as the endometrium, which are not characterizable through single biomarker assays. Transcriptome deconvolution into cellular composition using deep learning methods provides a means to explore this complexity. However, such computational methods have not been previously applied to EV bulk transcriptomes, and their efficacy in profiling EV population changes and concordance to tissue throughout the menstrual cycle remains unknown. Methods This observational cross-sectional study utilized a deconvolutional generative deep learning algorithm, BulkTrajBlend, trained on a comprehensive human endometrial single-cell RNA sequencing (scRNA-seq) atlas. The model was applied to deconvolve paired bulk transcriptomes from endometrial tissue and uterine fluid EVs (UF-EVs) across the proliferative (P, n=4), early-secretory (ES, n=5), mid-secretory (MS, n=5), and late-secretory (LS, n=5) phases from healthy, fertile women. To validate generalizability, independent UF-EV datasets (ES, n=12; MS, n=12) obtained via different laboratory protocols were included. Deconvolved pseudo-single-cell (pSC) profiles from UF-EV data were subsequently integrated with Visium spatial transcriptomics slides of human endometrium (P, n=2; MS, n=4; ES, n=2). Results We developed a foundation model-based approach utilizing self-supervised learning to determine the cellular origin of EVs from their transcriptomic profiles. By mapping the generated pSC profiles to spatial transcriptomic data, we evaluated spatial origins of EVs. The statistical analysis demonstrated that UF-EV transcriptome deconvolution reflects the dynamic changes in the cellular composition of endometrial tissue across the menstrual cycle phases. The ability to distinguish accurately between proliferative and decidualizing menstrual cycle phases (ROC-AUC = 0.98) using cellular profile of deconvoluted UF-EVs transcriptome enables non-invasive profiling of endometrial tissue. Conclusions Our findings indicate the feasibility of determining endometrial tissue cellular composition using UF-EV transcriptomics. This methodology enables refined, non-invasive endometrial testing, avoiding invasive biopsy procedures. Based on deconvolution results, we are able to correlate UF-EV content to tissue, and distinguish between menstrual cycle phases. These results build toward a multifactorial screening method for abnormalities within the endometrium.

6

Can synthetic data overcome the privacy and fidelity bottleneck in Pharmacometrics? A comparative benchmark using a daptomycin population pharmacokinetic model

Destere, A.; Lombardi, R.; Labriffe, M.; Benoist, C.; marquet, p.; Lavrut, T.; Gerard, A.; Bouveyron, c.; Woillard, J.-B.

2026-06-02 pharmacology and therapeutics 10.64898/2026.05.30.26354512 medRxiv

Top 0.1%

10.3%

Show abstract

Abstract Introduction The sharing of individual patient data is essential for advancing pharmacometrics but is strictly limited by privacy regulations (e.g., GDPR). While synthetic data generation offers a legally compliant alternative, its structural impact on complex nonlinear mixed-effects (NLME) modelling remains largely unexplored. This study aimed to benchmark five generative artificial intelligence algorithms by evaluating the balance between data privacy and the preservation of structural PK properties and clinical dosing guidance. Material & methods A daptomycin two-compartment PopPK model was used to simulate a reference cohort of 500 patients. Five generative algorithms (Modified AVATAR, Gaussian Copula, Synthpop, TVAE, and CTGAN) produced 100 independent synthetic datasets each. A two-stage evaluation framework was applied: first, a statistical indistinguishability test based on logistic regression (AUC ROC) was used as a macroscopic pre-selection criterion to determine algorithm eligibility for NLME modelling and privacy risk assessment. Privacy risk was independently quantified using the Anonymeter framework (Singling Out and Linkability attacks). Eligible algorithms were further evaluated on PK parameter recovery bias and clinical dosing simulations. Results Deep learning architectures (TVAE, CTGAN) were excluded at the pre-selection stage due to both biologically implausible covariate generation and high macroscopic detectability (mean AUC ROC = 0.837 and 0.986, respectively). Synthpop, AVATAR, and Gaussian Copula all passed the indistinguishability threshold (AUC ROC = 0.475 +- 0.033, 0.490 +- 0.013, and 0.619 +- 0.031, respectively) and proceeded to NLME evaluation. However, attack-based privacy assessment revealed that Synthpop carried an unacceptable singling-out risk (0.035), disqualifying it from privacy-preserving data sharing. AVATAR and Gaussian Copula demonstrated acceptable privacy profiles (singling-out = 0.004 and 0.001; linkability = 0.010 and 0.003, respectively). At the structural level, Gaussian Copula injected stochastic noise inflating residual error (+157.0%) and V1; (+25.9%), blunting predicted Cmax and predisposing to empirical dose escalation and risk of toxicity. AVATAR acted aSs a smoothing filter, deflating V2; (-48.3%) and underestimating CL (-12.9%). Forward clinical simulations confirmed directionally opposed prediction errors: Gaussian Copula consistently underestimated Cmax across standard and renally impaired profiles (-14.5% and -16.0%, respectively), predisposing to empirical dose escalation, whereas AVATAR- and Synthpop-derived models overestimated Cmax and Cmin in the obese infected patient (+14.7% and +8.2%, respectively), compounding the accumulation risk already present in this profile. Conclusion While no generative algorithm currently offers a perfect solution, AVATAR and Gaussian Copula represent the most viable candidates, being the only methods to satisfy both macroscopic indistinguishability and attack-based privacy criteria. These findings highlight the necessity of a structured, two-stage validation framework and suggest that, when coupled with therapeutic drug monitoring, synthetic datasets could significantly enhance multicentre collaboration while maintaining strict regulatory compliance

7

DamageFormer: a damage-aware multimodal deep learning framework for DNA lesion identification from nanopore sequencing

Yang, Q.; Li, L.; Ma, Q.; Yin, R.

2026-05-18 genomics 10.64898/2026.05.14.725245 medRxiv

Top 0.2%

10.1%

Show abstract

BackgroundDNA lesions arise from endogenous metabolism and environmental exposure and are the major drivers of mutagenesis, aging, and cancer development. However, mapping DNA damage at nucleotide resolution remains a technically challenging task. Nanopore sequencing enables direct detection of chemical perturbations through alterations in ionic current signals. Despite this potential, existing computational approaches remain limited in their capacity to generalize across diverse lesion types and to effectively integrate nucleotide sequence context with raw signal information for accurate detection and localization. ResultsWe presented DamageFormer, a multimodal deep learning framework for detection and localization of DNA lesions using native nanopore sequencing data. Central to this framework is LesionBERT, a damage-aware genomic foundation model built upon DNABERT-2 and enhanced with lesion-focused reconstruction objectives to improve representation of chemically modified bases. DamageFormer integrated LesionBERT with a neural signal model through an adaptive gating mechanism, enabling dynamic weighting of sequence context and nanopore signal evidence. The model was trained using a joint objective that combines prediction, localization, and contrastive alignment losses to promote cross-modal coherence and spatial precision. On an oxidative DNA damage benchmark comprising paired sequence and signal data, DamageFormer achieved an AUROC of 0.99997 for lesion detection and a mean absolute localization error of 0.00439, consistently outperforming state-of-the-art baselines. Model interpretation analyses revealed context-dependent modality weighting that adapts to variation in signal quality and sequence ambiguity. The proposed framework further generalizes to chemically distinct guanine lesions not observed during the training process, demonstrating its robustness and transferability to unseen damage types. ConclusionsDamage-aware biological language modeling combined with adaptive multimodal fusion enables accurate and interpretable identification of DNA lesions from nanopore sequencing data. This framework provides a scalable approach for characterizing genome-wide damage landscapes and illustrates how chemical DNA information can be systematically incorporated into genomic language models. The source code and pretrained models of this work are available at: https://github.com/UF-HOBIYin-Lab/DamageFormer.

8

In silico characterization of unique fungal modular rhodopsin expands the horizon of novel optobiological and biomedical applications

Kateriya, S.; Kumari, A.; Kumar, A.; Sharma, K.; Pati, S. R.; Mohanty, S.

2026-05-28 bioinformatics 10.64898/2026.05.25.727616 medRxiv

Top 0.2%

10.0%

Show abstract

Microbial modular rhodopsins, in which light-sensing rhodopsin domains are fused with effector modules, have emerged as promising tools for optogenetic regulation in algae and other systems. However, the diversity and potential regulatory roles of fungal modular rhodopsins remain largely unexplored. Here, we performed a comprehensive in-silico analysis to identify previously uncharacterized fungal modular-rhodopsins that pair a conserved light-sensing core with diverse effector domains, including RPEL-motif, NADP-binding Rossmann fold domain, MCM (Mini-Chromosome Maintenance) domain, and GC-cAT (Carnitine O-Acetyltransferase) modules. In Aureobasidium pullulans, the representative modular rhodopsin (ApRh-RPEL) contains RPEL-motif associated with actin-related and transcriptional regulatory processes, suggesting light-driven fungal signaling pathway involved in transcriptional and cellular regulation, respectively. Rhodopsins fused with NADP-binding Rossmann fold and MCM domains further indicate possible applications in light-programmable metabolic and cell-cycle signaling. Genome mining additionally revealed that A. pullulans harbours a diverse but underexplored array of biosynthetic gene clusters (BGCs), raising the intriguing possibility that light perception may regulate secondary metabolite pathways. Supporting this, multisource protein-protein interaction network analysis links ApRh-RPEL to enzymes involved in terpenoid and sphingolipid biosynthesis, indicating potential cross-talk between light-sensing module and metabolic regulation. These findings outline a computationally derived model in which fungal modular rhodopsins (ApRh-RPEL) function as opto-synthetic regulators of biosynthetic processes. Structural predictions confirmed conserved Schiff-base lysine and retinal-binding pocket, highlighting functional diversity across fungal rhodopsins. Together, these findings expand the optogenetic toolkit and provide a framework for engineering light-driven signaling in fungi, with applications in optobiological and biomedical applications.

9

Benchmarking long-context genome language models on biosynthetic gene clusters

Hirota, K.; Higashi, K.; Kurokawa, K.; Yamada, T.

2026-05-15 bioinformatics 10.64898/2026.05.12.724296 medRxiv

Top 0.2%

9.8%

Show abstract

Recent advances in language models for natural language processing have spread to the field of genomics, driving the development of genome language models (gLMs) to decipher genomic information. Cutting-edge long-context gLMs are promising approaches for understanding and designing biological complexity, but their evaluation remains underdeveloped. In this study, we introduce BGCs-Bench, a unified benchmark focused on biosynthetic gene clusters for assessing long-range genomic modeling on three downstream tasks: biosynthetic class prediction, taxonomic classification and coding sequence annotation. Using BGCs-Bench, we perform systematic and layer-wise evaluations of the embedding representations of long-context gLMs, demonstrating that layer selection is crucial for downstream task performance. In addition to the evaluation results, the logit lens analysis of autoregressive gLMs suggests that StripedHyena-based models consist of earlier layers to encode biologically meaningful information from input DNA sequences and deeper layers to optimize embeddings for sequence generation. These findings provide insights for more effective development and application of long-context gLMs.

10

AI-assisted improvement of Aspergillus oryzae β-galactosidase using an Ensemble of Protein Language Models

Trapote Fernandez, A.; Fernandez, A.; Mendez-Liter, J. A.; Prieto, A.; Barriuso, J.; Osorio, F. G.

2026-05-21 synthetic biology 10.64898/2026.05.20.726739 medRxiv

Top 0.2%

9.3%

Show abstract

{beta}-galactosidases (BGs) are essential enzymes widely used in the food industry, particularly in the production of lactose-free products. Among them, the BG from Aspergillus oryzae is of industrial relevance due to its activity at acidic pH and moderate thermal tolerance. However, enhancing its catalytic performance remains a key challenge. Traditional enzyme engineering methods are time-consuming and resource-intensive, limiting their scalability. Recent advances in Artificial Intelligence (AI), particularly those based on Natural Language Processing, offer a promising alternative by enabling efficient exploration of protein sequence space and prediction of beneficial mutations. In this study, we introduce an ensemble-based, zero-shot Protein Language Model pipeline that reconciles predictions from six independent models (ESM2 and the five ESM1v variants) combined with a diversity-aware candidate selection strategy. Applied to the BG from A. oryzae, this approach identified beneficial mutations leading to novel enzyme variants with up to a four-fold increase in catalytic efficiency on oNPGal, a two-fold increase on lactose, and, independently, a T338I variant with markedly enhanced thermostability ({approx}80% residual activity after 24 h at 60 {degrees}C), all without requiring supervised fine-tuning on experimental fitness data. Our results demonstrate that consensus across an ensemble of PLMs can efficiently enrich beneficial substitutions in industrially relevant enzymes and substantially reduce the number of wet-lab candidates that need to be screened. Table of Contents graphic O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=106 SRC="FIGDIR/small/726739v1_ufig1.gif" ALT="Figure 1"> View larger version (29K): org.highwire.dtl.DTLVardef@18084f7org.highwire.dtl.DTLVardef@99a102org.highwire.dtl.DTLVardef@19a64forg.highwire.dtl.DTLVardef@1f59cff_HPS_FORMAT_FIGEXP M_FIG C_FIG

11

Design of a Multi-epitope Vaccine Against Human Glanders Targeting Outer Membrane β-barrel Proteins of Burkholderia mallei

Kapoor, J.; Panda, A.; Kumar, S.; Bandyopadhyay, A.

2026-05-28 bioinformatics 10.64898/2026.05.25.727591 medRxiv

Top 0.2%

9.3%

Show abstract

Burkholderia mallei, a facultative intracellular Gram-negative pathogen, is the causative agent of glanders that primarily affects solipeds and sporadically transmitted to humans. Current interventions mainly rely on antibiotics; however, increasing resistance and the lack of a licensed vaccine further complicate disease management. In the present study, a consensus-based computational framework was employed on the B. mallei turkey2 proteome. Total 59 proteins - including porins, TonB receptors, autotransporters, and efflux components - were identified as surface exposed outer membrane {beta}-barrel (OMBB) proteins that were used to design a multi-epitope vaccine (MEV) construct. B- and T-cell epitopes were predicted from 59 proteins, and ten epitopes each of cytotoxic T-lymphocyte (CTL), helper T-lymphocyte (HTL), and B-cell were chosen based on their antigenicity, non-allergenicity, non-toxicity, surface accessibility, and conservation across 32 B. mallei strains. The MEV was included with suitable adjuvants at the N-terminus to enhance its immunogenicity. The 780 amino acid MEV construct was predicted to be antigenic, and soluble upon overexpression with 62.69% random coils, while the rest formed -helices and {beta}-strands. The tertiary structure of the MEV was generated and subsequently validated, indicating good structural quality. Molecular docking of the MEV with toll-like receptor 4 (TLR4) demonstrated strong affinity, and molecular dynamics simulation confirmed the structural stability of the MEV-TLR4 complex. In-silico immune simulation showed the capability of MEV to induce a strong immune response. The study proposes an MEV construct by utilizing surface exposed OMBB proteins which directly interact with the host and serve as effective immunogenic targets against B. mallei infection. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=80 SRC="FIGDIR/small/727591v1_ufig1.gif" ALT="Figure 1"> View larger version (40K): org.highwire.dtl.DTLVardef@10cd6d8org.highwire.dtl.DTLVardef@1ed3f0borg.highwire.dtl.DTLVardef@c6173forg.highwire.dtl.DTLVardef@1204f73_HPS_FORMAT_FIGEXP M_FIG C_FIG

12

Deep analysis of FANTOM CAGE data reveals hierarchical patterns of TSS co-deployment hubs and their disruption in cancers

Meduri, R.; Satish, A. L.; Singh, U.

2026-05-18 genomics 10.64898/2026.05.15.725323 medRxiv

Top 0.2%

9.3%

Show abstract

Selective deployment of multiple transcription start sites is a major regulatory feature of human transcriptomes. FANTOM CAGE data exhibit a near-universal TSS deployment parsimony which is disrupted in cancers. We have recently shown that TSS deployment is sensitive to gene function, futile upstream transcription, and cellular biosynthetic states. Patterns in FANTOM CAGE data can reveal mechanisms underlying TSS co-deployments. We propose and test the possibility that some TSSs act like epromoters and act as co-varying hubs of transcriptional activities for multiple other promoters. Using deep analysis of CAGE data implemented through neural networks we show that non-cancers implement transcription co-deployments through cores of epromoter-like TSSs which are generally proximal to their start codons. These TSSs show enhancer-like TFBSs profiles. A comparison with cancer CAGE data shows that the concentrated epromoter core is disrupted in cancers with multiple distal TSSs replacing the proximal TSS cores. We provide evidence that the core TSSs are rich in YY1 and CTCF binding sites and associated with genes coding for transcription factors. Our findings show that covariance of TSS deployment is sensitive to transcriptional resource cost and a parsimonic design of TSS co-deployments depends on proximal TSSs in non-cancers, a mechanism grossly disrupted in cancers. HighlightsO_LIHeterogeneous FANTOM CAGE data contains universal patterns of TSSs co-deployments. C_LIO_LITSS co-deployments exhibit a parsimonious "core-covariant" scheme which is disrupted in cancers. C_LIO_LICore TSSs are enriched in transcription factor binding sites and gene functions which justify biological features of the samples. C_LIO_LIThe DL pipeline we present identifies the core-covariant TSS sets in an unbiased manner. C_LI

13

Constrained Evolutionary Design of Matrixyl Analogs: Balancing Permeability and Functional Preservation Through Computational Optimization

Komianos, N.; Prakash, P.

2026-05-14 bioinformatics 10.64898/2026.05.12.724473 medRxiv

Top 0.2%

9.1%

Show abstract

Matrixyl (palmitoyl pentapeptide-4, KTTKS core) is a collagen-stimulating peptide used in topical anti-ageing products, but its in-use efficacy is limited by poor permeation through the stratum corneum. We describe a deterministic computational workflow that combines a tournament genetic algorithm and NSGA-II with exact RDKit molecular descriptors to search the fixed-length, edit-distance-2 neighbourhood of KTTKS (3,706 candidate sequences) for analogs with descriptors more favourable for passive transdermal diffusion. The search returns a 9-member Pareto frontier that quantifies the trade-off between predicted permeability and motif preservation. Five of the nine frontier members carry the same substitution, lysine to proline at position 4 (K4P). This single change lowers the topological polar surface area by 25.6%, removes the +1 charge contributed by lysine, and reduces the functional-preservation score from 1.00 (KTTKS) to 0.67. The frontier ranking is unchanged by {+/-}30% perturbations to the TPSA and Mw penalty weights and by a 30% increase in the LogP penalty; only a 30% reduction in the LogP penalty produces rank movement. The frontier matches the ground-truth Pareto set obtained by exhaustive enumeration of all 3,706 candidates (precision and recall both 100%). On the basis of these results we recommend three sequences for experimental validation: PTTPS (largest predicted gain), KTTPS (single-mutation, conservative), and KTTPP (backup). All code, results, and figures are released under MIT and CC BY 4.0.

14

Structural distance at the tRNA synthetase active site interface predicts pathogenicity but is captured by AlphaMissense and EVE except among score-ambiguous variants

Liebeskind, K.; Francklyn, C.; Barrantes Reynolds, R.

2026-05-26 bioinformatics 10.64898/2026.05.22.727252 medRxiv

Top 0.2%

8.5%

Show abstract

Variants of uncertain significance have accumulated as genomic sequencing has become more widespread, which complicates rare disease diagnosis and requires substantial resources for re-evaluation. Aminoacyl-tRNA synthetases (ARSs) are a protein family with extensive variant data and well-characterized disease associations, making them an ideal system for investigating the relationship between variant location and pathogenicity. Using structural distance measurements to the ARS-tRNA binding interface combined with existing pathogenicity predictors, AlphaMissense and EVE, we investigated whether explicit structural binding information could improve missense variant pathogenicity prediction. Pathogenic variants were found to cluster significantly closer to the tRNA-binding interface than benign variants (p = 0.0003). Incorporating explicit distance information into a Bayesian mixture model did not substantially improve predictive performance over AlphaMissense and EVE alone, suggesting that these models already implicitly capture relevant structural binding context. However, a clinically important subset of interface variants classified as ambiguous by both existing models identifies a specific gap where explicit structural distance information may provide added discriminative value, but the limited number of clinically validated variants currently available constrains the ability to fully evaluate this potential. Incorporating additional biologically relevant features not captured by existing models, such as protein stability or conformational dynamics, as well as refining structural distance calculations, may further improve classification of this subset. These findings highlight both the power and the limitations of existing pathogenicity predictors and suggest that structurally informed approaches targeting the binding interface represent a promising direction for improving classification of these ambiguous variants that have great clinical significance. Author SummaryAdvances in clinical genetic sequencing have caused increasing identification of genetic variants whose impact on human health is unknown. These "variants of uncertain significance" present a major challenge because their role in causing disease cannot yet be confirmed or ruled out. This study focuses on a specific family of essential enzymes called aminoacyl-tRNA synthetases, which play a critical role in the process of proteins translation. Mutations in these enzymes have been linked to a range of diseases. This project aims to provide a novel method for determining pathogenicity of variants specifically in aminoacyl-tRNA synthetases. We propose that physical proximity of a variant to the functional binding site of the enzyme is influential in determining pathogenicity. We find that this spatial relationship is a meaningful indicator of a variants potential to disrupt normal function.

15

Putative G-Quadruplex Structures in Cancer-Dysregulated Circulating lncRNAs and their G4-mediated Identification of Protein Interacting Partners

Singh, D.; Ghosh, A.; Mathur, S.; Patra, S.; Nasir, S.; Hadiya, R.; Datta, B.

2026-05-29 biochemistry 10.64898/2026.05.27.728349 medRxiv

Top 0.3%

8.4%

Show abstract

Circulating long non-coding RNAs (lncRNAs) have emerged as compelling cancer biomarkers. However, the structural features that mediate their extracellular stability and protein interactions remain largely unexplored. Here, we present the first systematic investigation of G-quadruplex (G4) motifs within cancer-dysregulated circulating lncRNAs and exploit these structures as molecular handles to identify associated RNA-binding protein (RBP) networks. From 283 circulating lncRNAs curated from the Lnc2Cancer 3.0 database, putative G-quadruplex-forming sequences (PQSs) were identified computationally using QGRS Mapper and G4Hunter, yielding four prioritized candidates -- AGAP2-AS1, LINC00683, DLG1-AS1, and KRTAP5-AS1 -- spanning 2G to 4G architectures. In vitro transcribed PQSs were validated for parallel G4 formation by circular dichroism spectroscopy, native polyacrylamide gel electrophoresis with thioflavin T staining, and reverse transcriptase stop assays, conducted under both standard buffer and simulated body fluid conditions to approximate the circulatory milieu. Electrophoretic mobility shift assays and isothermal titration calorimetry demonstrated nanomolar-affinity interactions between the G4-containing RNA and human serum albumin (HSA), the most abundant circulating protein. Cross-referencing G4-interacting proteins from the G4IPDB database with lncRNA-protein associations from LncTarD and NPInter, combined with RPISeq interaction predictions, identified ten candidate RBPs. A STRING-based protein-protein interaction (PPI) network was constructed at a confidence threshold of [≥]0.7 and refined iteratively using experimental stability data to exclude proteins associated exclusively with the structurally weaker KRTAP5-AS1. The resulting network, centered on ELAVL1, IGF2BP1, hnRNPA2B1, and FUS, highlights a coordinated post-transcriptional regulatory module relevant to oncogenesis. This work establishes a novel, experimentally validated framework wherein G4 motifs serve as entry points for decoding the protein interactome of circulating lncRNAs, with implications for cancer diagnostics and RNA-targeted therapeutic strategies. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=64 SRC="FIGDIR/small/728349v1_ufig1.gif" ALT="Figure 1"> View larger version (25K): org.highwire.dtl.DTLVardef@1bba7dborg.highwire.dtl.DTLVardef@1095a53org.highwire.dtl.DTLVardef@1092eddorg.highwire.dtl.DTLVardef@1e3b7e7_HPS_FORMAT_FIGEXP M_FIG C_FIG

16

A Multi-Epitope Vaccine Design for Human Pasteurellosis using Outer Membrane β-barrel Proteins of Pasteurella multocida

Panda, A.; Kapoor, J.; Kumar, S.; Bandyopadhyay, A.

2026-06-01 bioinformatics 10.64898/2026.05.28.728361 medRxiv

Top 0.3%

8.2%

Show abstract

Pasteurella multocida is a facultative anaerobic, Gram-negative coccobacillus that causes pasteurellosis in companion animals (cats and dogs), livestock, and poultry. Close contact with infected animals poses a significant zoonotic risk to humans through bite wounds, scratches, licking and transfer of bodily fluids. Current treatment relies mainly on antibiotics, and the lack of a licensed human vaccine further exacerbates the challenge. In the present study, a consensus-based computational approach was employed on the P. multocida Past 9 proteome. A total of 29 outer membrane {beta}-barrel (OMBB) proteins, including TonB-dependent receptors, porins, autotransporters, adhesins and efflux pumps, were identified and used to design a multi-epitope vaccine (MEV) construct. B-cell and T-cell epitopes were predicted from the identified proteins. Ten epitopes each of cytotoxic T-lymphocyte (CTL) and helper T-lymphocyte (HTL), and three B-cell epitopes were selected based on their antigenicity, non-allergenicity, non-toxicity, surface accessibility, and conservation across eight P. multocida human-infecting strains. The MEV was supplemented with suitable adjuvants at the N-terminus to enhance its immunogenicity. The MEV construct, with a length of 459 amino acids, was predicted to be antigenic, non-allergenic, non-toxic and soluble upon expression. The MEV structural model was generated and subsequently validated, which indicated good structural quality. Molecular docking between MEV and human toll-like receptor 4 (TLR4) demonstrated strong binding affinity, and molecular dynamics simulation confirmed the structural stability of the MEV-TLR4 complex. Immune simulation of the MEV construct elicited a strong immune response. This study proposes a designed MEV candidate against human pasteurellosis and highlights OMBB proteins as potential immunogenic targets for vaccine development. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=132 SRC="FIGDIR/small/728361v1_ufig1.gif" ALT="Figure 1"> View larger version (54K): org.highwire.dtl.DTLVardef@320d63org.highwire.dtl.DTLVardef@d0ddeorg.highwire.dtl.DTLVardef@1099802org.highwire.dtl.DTLVardef@dab304_HPS_FORMAT_FIGEXP M_FIG C_FIG

17

Assessing and Optimizing Low-Frequency Somatic Mutation Detection: A Multi-Platform High-Throughput Sequencing Perspective

Feng, B. N.; Lin, Y.; Liu, L.; Lin, Q.; Lin, Y.; Liu, Y.; Li, J.; Lei, C.; Chen, C.; Yang, M.; Peng, X.; Zhou, Z.; Yan, Q.; Sun, L.; Li, Q.

2026-06-01 bioinformatics 10.64898/2026.05.28.728367 medRxiv

Top 0.4%

6.7%

Show abstract

The availability of multiple commercial short-read sequencing platforms necessitates systematic cross-platform performance comparisons, particularly for challenging applications such as low-frequency somatic mutation detection. Here, a large-scale targeted sequencing dataset from five Genome in a Bottle (GIAB) human genomic DNA reference standards, HG001 to HG005, alongside Twist Biosciences cfDNA reference standards featuring 1% variant allele frequency (VAF), was generated by six platforms (NovaSeq 6000, NovaSeq X, FASTASeq 300, GenoLab M, SURFSeq 5000, and MGISEQ-T7). To build a realistic benchmark while keeping authentic sequencing backgrounds, we developed PosMix, a simulating tool that generates position-specific VAFs. To overcome the limitations of conventional variant callers (high recall with poor precision for VarScan2, higher precision with lower recall for Strelka2/Mutect2), we developed SomaticXGB, a machine learning-based caller. In this study, SURFSeq 5000 consistently exhibited the lowest error rates and achieved superior accuracy for VAFs as low as 0.5%, outperforming all other sequencing platforms. On the other hand, SomaticXGB attained F1 scores of approximately 0.92 on simulated datasets with VAFs ranging from 0.5% to 1.5% and 0.89 on Twist 1% standards, substantially outperforming conventional methods. This work delivers a valuable rich multi-platform data resource, offering a standardized pipeline for performance benchmarking and a machine learning-based strategy for optimized somatic mutation detection.

18

MIMOSA: A model-independent framework for transcription factor binding site motif similarity assessment

Tsukanov, A. V.; Levitsky, V. G.

2026-05-17 bioinformatics 10.64898/2026.05.13.725009 medRxiv

Top 0.4%

6.7%

Show abstract

MotivationTranscription factors (TFs) regulate gene expression by binding specific DNA sequences, which are commonly represented by motif models. Although position weight matrices (PWMs) remain the dominant motif representation, alternative models, such as Markov models, can capture interpositional dependencies and may provide higher predictive performance. However, existing motif comparison tools are designed mainly for PWMs or require motifs to be reduced to PWM/PPM representations. This creates a major bottleneck for comparing motifs represented by different model architectures. This limitation complicates the interpretation of de novo motif discovery results and hinders the systematic integration of diverse motif models into genomic analyses. ResultsWe present MIMOSA (Model-Independent Motif Similarity Assessment), a model-independent framework for direct comparison of TF binding site (TFBS) motifs regardless of their mathematical representation. MIMOSA assesses motif similarity by comparing calibrated recognition profiles produced by motifs of different models on the same DNA sequence set, rather than by comparing the motifs themselves. In a cross-database benchmark on HOCOMOCO motifs, MIMOSA achieved retrieval performance comparable to established PWM-oriented tools, including Tomtom and MACRO-APE, with MRR and Recall@k close to the best-performing methods. Pairwise ranking comparisons showed that MIMOSA captures a similarity signal consistent with existing approaches while providing a representation-independent comparison strategy. Application to de novo motifs derived from ChIP-seq data for the ATF3 TF demonstrated that recognition-profile comparison distinguished alternative spacer variants represented as separate PWMs from their integration within more flexible models such as BaMM and Slim. Thus, MIMOSA enables quantitative cross-model motif comparison and supports interpretation of motif heterogeneity in TFBS analyses. Availability and implementationMIMOSA is implemented in Python and is freely available at https://github.com/ubercomrade/mimosa.

19

A Novel Network Approach to Identify Sample-Specific Context-Informed Metabolic Signatures During Developmental Processes

Lee, E.; Koppayi, A.; Veiga-Lopez, A.; Penalver Bernabe, B.

2026-05-22 bioinformatics 10.64898/2026.05.20.726642 medRxiv

Top 0.5%

6.6%

Show abstract

Metabolism plays an essential role in cellular processes: development, growth, differentiation, and determination of cell identity. Understanding how metabolic processes dynamically change across cell types, stages, and environmental conditions is crucial for studying developmental biology, aging, and disease progression. Genome-wide metabolic models (GEMs) are a powerful network-based tool for studying these processes by integrating omics data to model context-specific metabolism. However, current approaches, such as Flux Balance Analysis (FBA), have limitations in addressing the dynamic nature of metabolism across developmental stages at a sample-specific resolution. To address this, we introduce a novel network-based method for analyzing cell and stage specific metabolic flow using directed and weighted metabolic networks that account for sample-specific transcriptomic data. We apply this method to study ovarian follicle development, providing a deeper understanding of intra-cellular metabolic processes, identifying key metabolites, enzymes, and potential markers for follicular maturation, important for IVF. By incorporating biologically meaningful data, this approach bridges the gap between theoretical metabolic network models (GEMs) and experimental observations, offering a systems-level view of metabolic dynamics in developmental and understudied contexts.

20

De novo design of binder proteins targeting Helicobacter pylori adhesin BabA

Zhu, Y.; isah, M. b.; Zhang, X.

2026-05-27 bioengineering 10.64898/2026.05.24.727452 medRxiv

Top 0.5%

6.5%

Show abstract

Helicobacter pylori has been classified as a Group 1 carcinogen by the International Agency for Research on Cancer of the World Health Organization and is one of the most well-established risk factors for gastric cancer. Long-term colonization by H. pylori depends on adhesin-mediated attachment to the gastric mucosa, among which the blood group antigen-binding adhesin BabA is a key surface factor involved in host recognition, tissue tropism, and persistent infection. In this study, we established a structure-guided computational design pipeline to develop compact protein binders targeting functionally relevant epitopes of BabA. First, using experimentally resolved BabA-antibody and BabA-nanobody complex structures as templates, we extracted structural contact residues on BabA through heavy-atom contact analysis, thereby defining antibody-recognition epitopes supported by complex-structure evidence. In addition, sequence-based, structure-based, and evolutionary conservation analyses were integrated to identify candidate functional epitope residues with high antigenicity, strong conservation, and surface-exposed features. On this basis, constrained de novo backbone generation was performed around the prioritized epitope regions, followed by amino acid sequence design and structural back-validation of the candidate binders. Candidate BabA-binder complexes were further evaluated using molecular docking, molecular dynamics simulations, and residue-level interface perturbation analysis to assess interface stability, epitope occupancy, and potential binding hotspots. This workflow enables systematic screening of BabA-targeting binders that may compete with antibody-recognized functional surfaces. Although these candidates still require experimental validation, this study provides a transferable computational framework for designing compact protein binders against pathogen adhesins by integrating experimentally resolved complex-structure resources with computational epitope prioritization based on sequence, conformation, and evolutionary conservation, and establishes a preliminary library of BabA candidate binders for subsequent validation and optimization.